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ABSTRACT 

This paper presents a novel approach for using clickthrough 
data to learn ranked retrieval functions for web search re- 
sults. We observe that users searching the web often perform 
a sequence, or chain, of queries with a similar information 
need. Using query chains, we generate new types of prefer- 
ence judgments from search engine logs, thus taking advan- 
tage of user intelligence in reformulating queries. To validate 
our method we perform a controlled user study comparing 
generated preference judgments to explicit relevance judg- 
ments. We also implemented a real- world search engine to 
test our approach, using a modified ranking SVM to learn 
an improved ranking function from preference data. Our 
results demonstrate significant improvements in the ranking 
given by the search engine. The learned rankings outper- 
form both a static ranking function, as well as one trained 
without considering query chains. 

Categories and Subject Descriptors 

H. 3.3 [Information Storage and Retrieval]: Information 
Search and Retrieval 

General Terms 

Algorithms, Experimentation, Measurement 

Keywords 

Search Engines, Implicit Feedback, Machine Learning, Sup- 
port Vector Machines, Clickthrough Data 

I. INTRODUCTION 

Designing effective ranking functions for free text retrieval 
has proved notoriously difficult. Retrieval functions designed 
for one collection and application often do not work well on 
other collections without additional time consuming modi- 
fications. This has led to interest in using machine learning 
methods for automatically learning ranked retrieval func- 
tions. 
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For this learning task, training data can be collected in 
two ways. One approach relies on actively soliciting training 
data by recording user queries and then asking users to ex- 
plicitly provide relevance judgments on retrieved documents 
(such as [71 1131 1^] ). Few users are willing to do this, making 
significant amounts of such data difficult to obtain. An al- 
ternative approach is to extract implicit relevance feedback 
from search engine log files (such as in |?)||15|L This allows 
virtually unlimited data to be collected at very low cost, 
although interpretation is more complex. 

Irrespective of the approach, to the best of our knowledge 
all previous research in learning retrieval functions has con- 
sidered each query independently. We will show that this 
ignores valuable information that is hidden in the sequence 
of queries and clicks in a search session. For instance, if we 
repeatedly observe the query "special collections" followed 
by another for "rare books" on a library search system, we 
may deduce that web pages relevant to the second query may 
also be relevant to the first. Additionally, this log informa- 
tion can also allow us to learn to correct spelling mistakes in 
a similar way. For example, we observed that users search- 
ing for the "Lexis Nexis" repository often first search for 
"Lexis Nexus" by mistake. 

As users search, it is well documented that they often 
reformulate their queries |3l 181 fT%l 120) . Previous work has 
attempted to predict query reformulations, but to the best 
of our knowledge these reformulations have never been used 
to learn better retrieval functions. In this paper, we refer to 
a sequence of reformulated queries as a query chain. When 
queries are considered independently, log files only provide 
implicit feedback on a few results at the top of the result 
set for each query because users very rarely look further 
down the list. The advantage of using query chains is that 
we can also deduce relevance judgments on the many more 
documents seen during an entire search session. 

The key contribution of this work is recognizing that we 
can successfully use evidence of query chains that is present 
in search engine log files to learn better retrieval functions. 
We demonstrate a simple method for automatically detect- 
ing query chains in query and clickthrough logs. Using 
this data, we show how to infer preference judgments as to 
the relative relevance of documents both within individual 
query results, and between documents returned by different 
queries within the same query chain. The method used to 
generate the preference judgments is validated using a con- 
trolled user study. We then adapt a ranking SVM to learn 
a ranked retrieval function from the preference judgments. 
In doing so, we propose a general retrieval model that can 



learn to associate individual documents with specific query 
words, even if the words do not occur in the documents. 
This differs from previous learned ranked retrieval functions 
in that our method can learn a much more general class of 
functions. 

We demonstrate the effectiveness of our approach on a 
real-world web search system, the Cornell University library 1 
web search. We name our implementation the Osmot search 
engine, and it is available for download to the research com- 
munity. The name is derived from the word osmosis, as 
learning from implicit feedback is, in our opinion, almost as 
good as learning from users by osmosis. 

2. RELATED WORK 

When learning to rank, the method by which training 
data is collected offers an important way to distinguish be- 
tween different approaches. This data usually consists of a 
set of statements as to the relevance of a document, or set 
of documents, to a given query. Such relevance judgments 
are either collected explicitly by asking users, or implicitly 
by observing user behavior and drawing conclusions. More- 
over, the statements can be absolute or relative. Absolute 
feedback involves statements that a particular document is, 
or is not, relevant to a query. Relative feedback involves 
statements that a particular document is more relevant to a 
query than some other document. 

Most previous work in learning to rank has assumed ab- 
solute relevance judgments. On the one hand, a number of 
methods in ordinal regression use explicit feedback to learn 
to rank, such as work by Crammer and Singer [7J, Rajaram 
et al. [22j and Herbrich et al. However, explicit feed- 

back is expensive to collect, with few users willing to spend 
the additional time to provide it in a real-world setting. This 
makes typical labeled data sets small and difficult to work 
with. A number of researchers have collected absolute rele- 
vance judgments implicitly from clickthrough logs, such as 
|H Ibl [T9l 125| . They postulate that documents clicked on in 
search results are highly likely to be relevant. For example, 
Kemp et al. |19| present a learning search engine using doc- 
ument transformation. They assume results clicked on are 
relevant to the query and append the query to these docu- 
ments. However, implicit clickthrough data has been shown 
to be biased as it is relative to the retrieval function qual- 
ity and ordering |15l I17| . This makes its interpretation as 
absolute feedback of questionable accuracy. 

Cohen et al. |H] and Freund et al. propose using 
log data to generate relative preference feedback. Both ap- 
proaches consider learning a ranking function from these 
preference judgments, along similar lines as this work. How- 
ever, in contrast to our method their learned function is 
limited to a combination of rankings given by a fixed set of 
manually constructed "experts" . This approach of learning 
a combination of functions is also used by most other work 
in this area [Tl 121 HI PHA I5T| . 

Joachims )15| refined the interpretation of clickthrough log 
data as relative feedback. He suggests that given a ranking 
and a clicked-on document d, any document ranked above d 
but not clicked on is likely less relevant than d. In this paper, 
we evaluate the validity of this construction, and extend it to 
query chains. We also use a more general ranking function 
and extend the learning algorithm to query chains. 

1 http:/ /library.cornell.edu/ 
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Figure 1: Percentage of time an abstract was 
viewed/clicked on depending on the rank of the re- 
sult. 
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Figure 2: Mean number of abstracts viewed above 
and below a clicked link depending on its rank. 

An important innovation in this paper is that we learn a 
more general ranking function than previous work by asso- 
ciating query words with specific documents. This approach 
has been used previously to learn to generate abstracts |23|. 
and in document transformation |19|. but not to learn rank- 
ing functions. Prior approaches cannot learn to associate 
"new" documents with a given query because they combine 
or re-order results obtained from one or more static ranking 
functions. In particular, given a query q, they cannot learn 
to retrieve any document not originally returned by q. Com- 
ing closest to solving this limitation previously, the method 
presented by Kemp et al. |19| could be extended with query 
chains. However, they assume implicit absolute feedback, 
making their approach more likely to be susceptible to bias 
and noise. 

3. ANALYSIS OF USER BEHAVIOR 

In order to infer implicit preference judgments from log 
files, we need to understand how users assess search results. 
Clearly we can only derive valid feedback for results that 
the user actually looked at and assessed. In this section we 
explore this question. 

An eye tracking study was performed to observe how users 
formulate queries, assess the results returned by the search 
engine and select the links they click on |11II12| . Thirty six 
undergraduate student volunteers were instructed to search 
for the answers to five navigational and five informational 
queries pp. The former involved finding a specific web page 
while the latter involved finding some specific information. 
The subjects were asked to start from the Google search 
page and find the answers. There were no restrictions on 
what queries they may choose, how and when to reformu- 
late queries, or which links to follow. Users were told that 
the goal of the study was to observe how people search the 



Query 1: NDLF 


1. 

2. 
3. 
4. 
5. 


http://.../staffweb/SMG/SMG970319.html 
http://.../staffweb/SMG/SMG970226.html 
http://.../staffweb/SMG/SMG960417.html 
http://.../staffweb/SMG/SMG960403.html 
http://.../stafrweb/SMG/SMG960828.html 





Query 2: "Ezra Cornell" residence 

1. Dear Uncle Ezra - Questions for Tuesday, May. . . 

2. Dear Uncle Ezra - Questions for Thursday,. . . 

3. Ezra Cornell had close Albion ties 

4. October 1904 - Albion 100 Years Age 

5. Cornell competes with Off-Housing market 



Figure 3: Two example queries and result sets. 

Web, but were not told of the specific interest in their be- 
havior on the results page of Google. All clicks, the results 
returned by Google, and the pages connected to the results 
were recorded by an HTTP proxy. Movement of the eyes was 
recorded using an ASL 504 commercial eye tracker (Applied 
Science Technologies, Bedford, MA). More details on the 
experimental setup are provided in I12|. 

Figure Q shows the fraction of the time users looked at, 
and clicked on, each of the top 10 search results for a query. 
It tells us that users usually look at least at the top two result 
abstracts. Interestingly, note that despite the top two doc- 
uments receiving almost equal attention, users were much 
more likely to click on the first result. Figure (adapted 
from Figure 2 in |12| 1 shows the number of abstracts viewed 
above and below any result that was clicked on. This fig- 
ure tells us that users usually scan the results in order from 
top to bottom. We also see that users usually look at one 
abstract below any they click on. Further analysis showed 
that this is usually the abstract immediately below the one 
clicked on |17|. We conclude that users typically look at 
most of the results from the first to the one below the last 
one clicked on. 

Previous work studying web search behavior |2()l 124) ob- 
served that users rarely run only a single query and imme- 
diately find suitable results. Rather, they tend to perform 
a sequence of queries for any given question. Such query 
chains are also observed in the eye tracking study. The mean 
query chain length was 2.2 queries, although the particular 
questions asked and the laboratory environment would be 
expected to have an influence on this value. A number of 
papers (e.g. 1101 I18| 1 successfully learn to predict query 
reformulations. Their success on this task suggests that the 
problem of detecting query chains, which we will have to 
address, is feasible. 

4. FROM LOG FILES TO FEEDBACK 

This section details our approach for generating relative 
preference feedback from query and clickthrough logs as im- 
plemented in the Osmot search engine. We then present 
an evaluation of this approach using results from the eye 
tracking study. 

Consider the queries shown in Figure[3]as examples we use 
to demonstrate the value of query chains. The first shows 
the results presented to a user running the query "NDLF" 
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Figure 4: Feedback strategies. We either consider a 
single query, q, or a query q that has been preceded 
by a query q' . Given a query, a dot represents a 
result document and an x indicates the result was 
clicked on. We generate a constraint for each arrow 
shown, with respect to the query marked. 



on the Cornell University library search page. The user is 
searching for the National Digital Library Foundation web- 
site, but has retrieved only meeting notes that reference peo- 
ple working for the NDLF. The desired page is not in these 
results, most probably because it does not contain the word 
"NDLF" . The second query is a search performed in Google 
by a participant in the eye tracking study in attempting 
to find the name of the house that Ezra Cornell built for 
himself. We get many results, but in fact none of the top 
10 contain any relevant information. In both cases, single 
query feedback will not be informative because no relevant 
documents were retrieved. In the former case, the results 
simply do not contain any documents relevant to the query. 
In the latter, if there is a relevant document it is unlikely 
the user will look far enough in the results to see it. 

On the other hand, after both of these queries, we ob- 
served that the user continued running other queries. Often, 
such later queries are more successful. If a user finds a rele- 
vant document with a later query, it is reasonable to assume 
that the user would have preferred to have seen the relevant 
document over the results actually returned earlier. Recog- 
nizing the information necessary to make these deductions 
is present in search engine log files, we now propose spe- 
cific strategies for generating such preference feedback from 
query chains. We defer a discussion of how to group queries 
into query chains to Section [(J 

4.1 Implicit Feedback Strategies 

We generate preference feedback using six strategies. These 
strategies are illustrated in Figure^] The first two strategies 
show preferences that can be inferred without query chains. 
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Figure 5: Sample query chain and the feedback that 
would be generated using all six feedback strategies. 
Two queries were run, and each returned three doc- 
uments. One document in each query was clicked 
on. di > q dj means that di is preferred over dj with 
respect to the query q. 



The first one, "Click > q Skip Above" was proposed in [HJ 
I15| . This strategy proposes that given a clicked-on docu- 
ment (marked x in the figure) , any higher ranked document 
that was not clicked on is likely less relevant. The preference 
is indicated by an arrow labeled with the query, to show 
that the preference is with respect to that query. We ex- 
pect this to be valid because the eye tracking study showed 
that users view results in order, and a user is unlikely to 
click on a document she considers less relevant than another 
document she observed. Note that these preferences are not 
stating that the clicked-on document is relevant, rather that 
it is more likely to be relevant than the ones not clicked on 
above. The second strategy, "Click First > q No-Click Sec- 
ond" makes use of the fact that users typically view both 
of the top two results before clicking. It states that if the 
first document is clicked on, but the second is not, the first 
is likely more relevant than the second. It seems reasonable 
to assume that having considered two options, the user is 
likely to click on the more relevant one. 

The next two strategies are identical to the first two except 
that they generate feedback with respect to the previous 
query. The intuition behind this is that since the two queries 
belong to the same query chain, the user is looking for the 
same information with both. Had the user been presented 
with the new results for the earlier query, she would have 
preferred the clicked-on document over those skipped above. 

The last two strategies make the most use of query chains. 
The strategy "Click > q i Skip Earlier Query" states that a 
clicked-on document is preferred over any result not clicked 
on in an earlier query q' (within the same query chain). 
This judgment is made with respect to the earlier query, 2 
q' . Since the eye tracking study revealed that users usually 
look one document past the last one clicked on, we also 
generate a preference for this document. In the event that 
no documents were clicked on in the earlier query, we use 
the fact that users usually look at the top two results. This 
is exploited in the feedback strategy "Click > q i Top Two 
Earlier Query" by generating preferences for the top two 
results. In the unusual case where there are not enough 
results to the earlier query to use these strategies, we select 
a random document as if it had been at the end of the results. 

Ultimately, given some query chain, we make use of all 
six strategies to generate the preference feedback. Figure 



2 It is unnecessary to state the same thing with respect to the 
later query q because presumably the preference is already 
satisfied, or the user would have seen the same result earlier. 



Strategy 


Accuracy 


Click > q Skip Above 
Click First > q No-Click Second 
Click > q Skip Earlier Query 
Click > q Top Two Earlier Query 


78.2 ± 5.6 

63.4 ± 16.5 
68.0 ± 8.4 

84.5 ± 6.1 


Inter- Judge Agreement 


86.4 



Table 1: Accuracy of the strategies for generating 
pairwise preferences from clicks. The base of com- 
parison are the explicit page judgments. Note that 
the first two cases cover two preferences strategies 
each. 



gives a sample query chain and the feedback that would be 
generated in this case. 

4.2 Accuracy of Feedback Strategies 

While the feedback strategies proposed above are intu- 
itively appealing, a quantitative evaluation is necessary to 
establish their degree of validity. To determine the accu- 
racy of each individual strategy, we conducted a controlled 
experiment following the setup of the eye-tracking study de- 
scribed in Section |3] for an additional 16 subjects. For these 
subjects, we evaluated in how far the preferences derived 
from the feedback strategies agree with explicit relevance 
judgments made by independent judges. 

For these 16 subjects, we collected all results and their 
associated web pages returned by Google from the HTTP- 
proxy cache that recorded their sessions. We grouped the 
results by query chain and subject and collected explicit rele- 
vance judgments using five judges. The judges were asked to 
weakly order all results encountered during each query chain 
according to their relevance to the question. To avoid bias- 
ing the judges, the order in which results were presented to 
the judges was randomized and the judges were not given the 
abstracts Google used when presenting the results. Some of 
the query chains were assessed by two judges for inter-judge 
agreement verification. The agreement between judges is 
reasonably high. Whenever two judges expressed a strict 
preference between two pages, they agree in the direction of 
preference in 86.4% of the cases. 

We now evaluate the extent to which the preferences gen- 
erated from clicks agree with the explicit judgments. Ta- 
ble Q summarizes the results. The table shows the percent- 
age of times the preferences generated from clicks using the 
above strategies agree with the direction of a strict prefer- 
ence of a relevance judge. The first two lines in the table 
show the accuracy of the strategies that do not exploit query 
chains. The "Click > q Skip Above" strategy is 78.2% accu- 
rate, which is substantially and significantly better than the 
random baseline of 50%. Furthermore, it is reasonably close 
in accuracy to the average agreement of 86.4% between the 
explicit judgments from different judges, which can serve as 
an upper bound for the accuracy one could ideally expect 
even from explicit user feedback. The second within-query 
strategy, "Click First > q No-Click Second", appears less ac- 
curate. However, since it produces fewer preferences (i.e. 
only on queries where the user clicked exclusively on the 
first link), the confidence intervals are large. Independent 
of the accuracy, the preferences from this strategy are prob- 
ably less informative, since they only confirm the current 
ranking and never suggest a reordering. 



CosineDistance(ql, q2) 

CosineDistance(doc ids of rl', doc ids of r2') 
CosineDistance(abstracts of rl', abstracts of r2') 
TrigramMatch(ql, q2) 
ShareOneWord(ql, q2) 
ShareTwoWords(ql, q2) 
SharePhraseOfTwoWords(ql, q2) 
NumberOfDifferentWords(ql, q2) 
*2 - tl < {5, 10, 30, 100} seconds 
t2 - tl > 100 seconds 
NormalizedNumberOfClicks(rl) 
NormalizedMin(|rl|, \r2\) 
NormalizedMax(jrl|, \r2\) 



Lines 3 and 4 in Table show the accuracy of the two 
strategies that exploit query chains. Both "Click > q i Skip 
Earlier Query" and "Click > q i Top Two Earlier Query" are 
significantly more accurate than random. In particular, the 
accuracy of "Click > q i Top Two Earlier Query" is very close 
to the average agreement between judges. Note that this 
strategy produces particularly informative preferences, since 
it associates documents with query words that may not oc- 
cur in the document. 

A possible explanation for the difference in accuracy be- 
tween the two query-chain strategies is that they apply to 
different types of query chains. While "Click > q i Skip Ear- 
lier Query" is applied when the previous query received a 
click, the strategy "Click > q i Top Two Earlier Query" is 
applied precisely in the opposite case. To investigate the ef- 
fect of this difference, we also evaluated a variant of "Click 
> q i Top Two Earlier Query". This variant generates prefer- 
ences analogous to "Click > q i Top Two Earlier Query", but 
in chains where the previous query did receive a click (but 
excluding the clicked results). The accuracy of this strategy 
is 67.7% ±9.4, indicating that the absence of a click followed 
by another query with a click is particularly strong evidence 
regarding the relevance of the results of the earlier query. 

Overall, we conclude that the preferences generated from 
the clickthrough logs are reasonably accurate and that they 
convey information regarding the user's preferences. 

5. EVALUATION ENVIRONMENT 

While the previous section showed that the preferences 
generated from logs files are accurate, can they be used to 
learn an improved retrieval system? 

To address this question, we constructed a publicly ac- 
cessible real- world search engine. The search engine imple- 
ments a full-text search of web pages maintained by the 
Cornell University library 1 (CUL). This collection includes 
over 13,500 web pages. We used the Nutch search engine 3 
as a starting point, with the Osmot search engine effectively 
being a wrapper around Nutch that implements logging, log 
analysis, learning, reranking and evaluation functionality. 
Osmot is designed to allow any number of different rank- 
ing functions to be plugged into it. In the experiments in 
this paper, we chose Nutch's built-in retrieval function as 
the baseline to compare against and build upon. The Nutch 
retrieval function is based on the cosine distance and incor- 
porates several modifications to make it more suitable for 
web search including special cases for phrase matches and 
HTML fields. 

6. DETECTING QUERY CHAINS 

In order to use query chains, we must first have a method 
to identify them. In this section we propose such a heuristic 
and demonstrate its effectiveness. 

As a basis for our evaluation, we created a dataset us- 
ing search logs from the CUL search engine. We manually 
labeled query chains in the logs for a period of 5 weeks. 
The search logs recorded the query, date, IP address, re- 
sults returned, number of clicks on the results and a session 
id uniquely assigned to each user. We extracted the list 
of queries, grouped them by IP address and sorted them 
chronologically. Queries from an IP address with no other 
queries within 24 hours were automatically marked as not 

3 http:/ /www. nutch. org/ 



Table 2: Features used to learn to classify query 
chains, ql and q2 are two queries at times tl and t2, 
with tl < t2. rl and r2 are the respective result sets, 
with rl' and r2' being the top 10 results. 



belonging to a query chain. This resulted in 1285 queries. 
Two judges (the authors of this paper) then individually 
grouped the queries into query chains manually, using search 
engines to resolve uncertainties (such as a query for a person 
followed by one for the department where the person is a fac- 
ulty member) . Finally, the judges combined their identified 
query chains, resolving the small number of disagreements 
between themselves through further investigation. 

For each pair of queries from the same IP address within 
half an hour, we generated a training example by construct- 
ing a feature vector. The training example was labeled using 
the query chains identified manually. If the two queries be- 
longed to the same query chain the example was labeled as 
positive. Otherwise it was labeled as negative. This led to 
3418 training examples of which 3096 were labeled as posi- 
tive. The feature vector generated given two queries ql and 
q2 consisted of the 16 features shown in Tabic |5| 

Using this data, we trained a number of SVM classi- 
fiers with various parameters. The classifiers learned tended 
to label almost all examples as positive. Among our best 
performing models was an SVM with an RBF kernel with 
C = 100 and 7=1. Evaluating using five-fold cross valida- 
tion, it gave an average accuracy of 94.3% and precision of 
96.5%. This compares to a accuracy and precision of 91.6% 
for a simple non-learning strategy where we assume all pairs 
of queries from the same IP address within half an hour of 
each other are in the same query chain. As this difference is 
relatively small, and computing this feature vector for every 
query pair is relatively expensive (in particular since it de- 
pends on the abstracts retrieved) , we decided to rely simply 
on our heuristic measure. We judged that a precision of over 
90% is sufficient for our present purposes. We considered ex- 
tending the half-hour window on our training data in order 
to increase the recall, but decided that we were recognizing 
a sufficient number of query chains without doing so. 

However, to gain some insight into the properties of query 
chains we trained a linear SVM using the same data and 
computed the total weight on each feature. The features 
with largest positive weight were CosineDistance(ql, q2), 
which measures the cosine distance between ql and q2, and 
CosineDistance(doc ids of rl', doc ids of r2'), which mea- 
sures the overlap between the documents in the top 10 re- 
sults. This indicates that if two queries are similar, or if 



they retrieve many of the same documents, then they are 
more likely to be in the same query chain. The feature with 
largest negative weight measures the minimum number of 
results returned by either query normalized between and 
1, NormalizedMin(|rl|, \r2\). This indicates that if one of 
the queries returns few results, the queries are more likely to 
be in a query chain. Our interpretation is that if ql returns 
no results, the user is more likely to run a second query. 

We conclude that it is possible to segment log files into 
query chains with reasonable accuracy. 

7. LEARNING RANKING FUNCTIONS 

Given log files recording user behavior on a web search 
engine, we have shown how to transform the log records into 
preference judgments in Section [I] after identifying query 
chains using the method from Section HJ Next, we present 
an algorithm to learn from these preferences, which we then 
evaluate using the Osmot search engine described earlier. 

We assume as input preference judgments over documents 
di and dj for a given query q to be of the following form. 

di > q dj (1) 

Such a preference judgment indicates that di is preferred 
over dj given q. As our retrieval model, we chose a linear 
retrieval function: 



rel(di,q) = w ■ $(di,q) 



(2) 



where g) (which we define later) is a function that maps 
documents and queries to a feature vector. Intuitively, it can 
be thought of as a feature vector describing the quality of 
the match between a document di and the query q. w is 
a weight vector that assigns weights to each of the features 
in $, thus giving us a real valued retrieval function where 
a higher score indicates a document di is estimated to be 
more relevant to the query q. The task of learning a ranking 
function becomes one of learning an optimal w. 

7.1 Ranking SVMs 

We used a modified ranking SVM to learn w in Equation 
121 Here, we briefly introduce ranking SVMs 15 , which gen- 
eralize ordinal regression SVMs |13|. We start by rewriting 
Equation Q as: 

w ■ &(di,q) > W ■ <$>{dj,q) 

We then add a margin, and non-negative slack variables to 
allow some of the preference constraints to be violated, as 
is done with classification SVMs. This yields a preference 
constraint over w. 

w ■ $>{di,q) > w ■ $(dj,q) + 1 - iij 

Although we cannot efficiently find a w that minimizes the 
number of violated constraints, we can minimize an upper 
bound on the number of violated constraints, Si- 
multaneously maximizing the margin leads to the following 
convex quadratic optimization problem: 



subject to 

V(q,i,j) : w ■ ${di,q) > w ■ ${dj,q) + 1 - £y 
Wi,j: (I 



(3) 



We will later add more constraints to the optimization prob- 
lem taking advantage of prior knowledge in the learning to 
rank setting. 



7.2 Retrieval Function Model 

Next we must specify the mapping &(di,q). This defini- 
tion is key in determining what class of ranking functions we 
can learn, and is therefore particularly important in deter- 
mining the usefulness of this method. We define two types of 
features: rank features (j>{. ank {d, q) and term/document fea- 
tures 4>terms{d,q). Rank features serve to exploit the exist- 
ing retrieval functions rel^, while term/document features 
allow us to learn more fine-grained relationships between 
particular query terms and specific documents. 

First we need a few definitions. Let T := {ti, . . . , tjv} 
be all the terms (words) in our dictionary. A query q is 
a set of terms q := {t^, t' n } where t' t G T. Let 

D := {di, . . . , be the set of all documents in our 

collection. We assume the original search engine has a num- 
ber of available retrieval functions rel^{d, q) with f £ F. We 
define r^(q) as the ordered set of results as ranked by rel^ 
for query q. In the experiments in this paper, F consists of 
a single ranking function as provided by Nutch for the sake 
of simplicity. 

Now, 



. 4>terms{d, q) _ 



<t>ra„k( d , l) 



l(Rank(d in r^{d, q)) < 1) 

l(Rank{d in r((q)) < 10) 
l(Rank{d in r ! {q)) < 15) 

l(Rank(d in r£(q)) < 100) 



4>terms(d, q) = 



l(d = di Ah € q) 



l(d = d M A ijv G q) 



where 1 is the indicator function. 

Before looking at the term features 4>terms{d,q), let's ex- 
plore the rank features 4>l' an k(d,q). For each retrieval func- 
tion reZjj* we have 28 rank features (for ranks 1,2,.. ,10,15, 
20,.., 100). Each of these is set to 1 if the rank of the docu- 
ment in rj}* is at or above the specified rank. 

The rank features allow us to learn weights for the rank- 
ings of the original search results. This allows the learned 
ranking function to combine different retrieval functions with 
different weights, as is done in prior work described earlier. 
We do not consider the specific scores assigned by rel^ in 
order to account for potentially different magnitudes of the 
scores from different retrieval functions. This also ensures 
that our method could generalize to settings where we do not 
have access to the scores assigned to documents but only the 
document ranks. As an example, if some document d is at 
rank 4 given query q and using retrieval function /i then 
<t>lank(d, q) = [0, 0, 0, 1, 1] T . If a document is not 

ranked in the top 100 by the retrieval function /i, then all 
the features of ^J ni are 0. This means that documents not 
ranked in the top 100 results by a retrieval function rel{f 



are indistinguishable using the 



features (although we 



could increase the maximum rank considered arbitrarily). 
We chose this cutoff as it is extremely rare for users to look 
beyond the top 100 results. 

We also have NM term/document features. For conve- 
nience, let </>l'£ rm (d, q) correspond to the term with di and tj 
in (fiterms(d, q) . There is one for every (term, document) pair 
in T x D. The term/document features allow the ranking 
function to learn associations between specific query words 
and documents by assigning a non-zero value to the appro- 
priate weight. This is usually an extremely large number of 
features, although most never appear in our training data 
and can thus be ignored. Furthermore, the feature vec- 
tor (f>terms(d,q) is very sparse. For any particular docu- 
ment d £ D, given a query with \q\ terms, only \q\ of the 
<fiterm(d, q) features are set to 1. Specifically, only the terms 
for one i value (where d = di) and with tj £ q are non-zero. 
The sparsity makes this problem well suited for solving us- 
ing support vector machines. A positive value of the weight 



ranking r 
d[ 
d-i 
d 3 



ranking r' 
d~ 2 
ds 
di 



combined(r, 
di 
d 2 
d 5 
d 3 
di 
h 



Figure 6: Two example rankings with four results 
each, and the combined outputs we would gener- 
ate by starting with the top ranked document from 
ranking r. 



j, associated with the feature 



Kerm> indicates that d- 



is more likely to be relevant to queries containing the term document ranked at position i in r J {q). In this case, 



tj, while a negative value means the opposite. 

7.3 Adding Prior Knowledge 

When learning to rank, we have additional prior knowl- 
edge that should be incorporated into this problem. Absent 
any other information, documents with a higher rank in the 
original ranking should be ranked higher in the learned rank- 
ing system. This is intuitive because on average we would 
expect the document relevance to be a decreasing function of 
the original rank of the documents, unless the original rank- 
ing function is particularly poor. We define such additional 
constraints in this section. 

It is also of practical importance to add these constraints: 
In our training data almost all of the relevance judgments 
generated state that a lower ranked document is preferred to 
a higher ranked document. Without additional constraints, 
a trivial and undesirable solution to the optimization prob- 
lem in Equation |21 would be one that reverses the original 
ranking by assigning a negative value to each of the weights 
corresponding to rank features in $. To see this, consider 
again Figure |1] The "Click > 9 ( q ') Skip Above" preferences 
would be satisfied if the rankings were reversed. These pref- 
erences are much more common than "Click First > q ( q >) 
No-Click Second" preferences. In the last two preferences 
classes, the preferred document is also presumably some- 
where much lower in the results for q (if it is not in the 
results, we can think of it as being at the bottom of the 
results), and hence the preferences would also be satisfied if 
the entire ranking were reversed. 

We add additional hard constraints to the optimization 
problem specified in Equation |3] These constraints require 
that weights for each of the rank features must be greater 
than a constant positive value w m in'- 



Vi £ [1, 28|F|]. w i > Wr, 



(4) 



Intuitively, w m i„ limits how quickly the original ranking 
is changed by training data. To see this, briefly consider 
a setting where we have a single ranking function / and a 
query q — t' that returns at least 100 results. Let di be the 



295, q) 



0, 0, 1] J 
0, 1, ll 3 



rank \ 



k (di, q ) = [l, l, l, i] 1 

Calling the part of w that corresponds to rank features 
tu„„t, from Equation [I] we then get 

W rank ■ <t>lank(dl00, q) > W m in 

m in 

Wrank ■ </>t ank (dl, ?) > 2%W m in 

Now say we have a document d that is preferred over d\ 
but is not in the original results, d would be ranked highest 
if rel(d,q) > rel(di,q). We know from Section 17.21 that 
only <^ e ' rTO (d, q) is non-zero in fit 
simplifying, this would imply: 



d,q 



28w + Wte rms ' y 



(d, q) . Expanding and 

terms (^1 , *?) 



28w m in + W t erms 
a,0 



where w"^ m corresponds to (p^m (d, q) ■ 

The larger w m in, the larger in magnitude wf£ m and wf^ m 
must be before this happens. A ranking SVM minimizes over 
ij , so the terms will only become large if there 
is sufficient training data to support a reordering. 

7.4 Evaluation Methodology 

In order to evaluate our results, we need an unbiased 
method for comparing two ranked retrieval functions. For 
this purpose we use the method detailed in |16| . This method 
was shown to give an accurate assessment of retrieval quality 
under reasonable assumptions. Given two ranking functions, 
we present users with a combination of the results from both. 
We know that users scan results from top to bottom, so we 
intertwine the results such that there is no presentation bias 
favoring either ranking function. This evaluation method is 
built into the Osmot search engine. 

Figure [(J shows two example rankings, r and r' , from 
two different retrieval functions as well as a combination 
of them, combined(r, r'). Let seen(n,r) and seen(n,r') be 



the number of results the user has seen from rankings r and 
r' respectively after looking at the top n results from the 
combined ranking. seen(n,r) and seen(n,r') are defined 
as the smallest number of results that we have to combine 
from r and r' to produce the top n results of the combined 
ranking. We generate the combined ranking such that for 
any n, seen(n,r) > seen(n,r') > seen(n,r) — 1. In our 
example, if the user looks at the top three results in the 
combined ranking, this is satisfied because seen(3, r) = 2 
and seen(3,r') — 2. If the user looks at the top five results, 
seen(5, r) = 4 and seen(5, r') = 3. To compensate for a bias 
toward the results of r (seen(n, r) is sometimes one bigger 
than seen(n, r')), we randomly switch r and r' half the time. 
This means that in expectation seen(n,r) — seen(n,r'). 
The property is proved rigorously in |16|. 

Once we have presented the user with a combined ranking, 
we need to evaluate which of the two rankings is preferred. 
We first determine which results the user looked at by taking 
the lowest ranked clicked-on document as where the user 
stopped scanning the results (a conservative estimate). If 
the two rankings are equally good, we would expect the 
user to click on just as many results from each given that 
she has seen an equal number from each (in expectation). 
We measure clicks(r), the number of documents clicked on 
that are in the top seen(n, r) results of r, and similarly 
clicks(r'). For example, in Figure say the user clicked 
on di and ds. We would infer the user looked at the top 3 
results. From before, we have seen(3,r) = seen(3,r') — 2. 
Therefore, clicks(r) — 1 (di) and clicks(r') = 1 (ds). 

If in expectation clicks(r) > clicks(r'), we can conclude 
that the user prefers the ranking r over r' . When evaluating 
ranking functions, we count how often clicks(r) > clicks(r'), 
and clicka(r) < clicks(r'). We then use a binomial sign test 
to verify if the difference in counts of clicks(r) > clicks(r') 
and clicks(r) < clicks(r') is statistically significant. If so, 
we can say one ranking is preferred over the other. 

7.5 Training the Ranking SVM 

We collected training data from the CUL search engine 
using the original ranking function between June and De- 
cember 2004. During this time, we recorded user queries 
and clicks, observing 9,949 queries and 7,429 clicks. While 
we were collecting this data, the users saw results as ranked 
by the built-in Nutch retrieval function, which we denote as 
relo- This gave 120,134 preferences constraints by applying 
all six strategies introduced above. We call these preferences 
Pqc- Of these, 45,610 preferences were generated without 
using the query chain strategies. We call this subset of the 
preferences Pnc- 

After adding the hard constraints as described above, we 
trained a ranking SVM for each of the two sets of prefer- 
ences with a linear kernel and a default value of C using 
SVM UaM \H\. We set w min = 1. Using the preferences 
Pqc we learned a retrieval function relQc and using the 
preferences Pnc we learned reZjvc- The former model has 
41,354 support vectors, while the latter has 18,034. 

The ranking model learned using query chains, relQc, in- 
stantiated 18,748 features. The number of features instan- 
tiated can be expected to grow almost linearly in the size of 
the document collection, and sub-linearly in the amount of 
training data collected (depending on overall user search be- 
havior) . However, this did not pose a problem from the SVM 
solver because all the preference judgments were sparse. 



Evaluation 
Mode 


Chains 


User Prefers 
Other 


Indifferent 


relqc vs. relo 
relQc vs. re/jvc 


392 (32%) 
211 (17%) 


239 (20%) 
160 (13%) 


579 (47%) 
855 (70%) 



Table 3: Results on Cornell Library search engine. 
relo is the original retrieval function, relQc is that 
trained using query chains, and relMC is that trained 
without using query chains. 



7.6 Results and Discussion 

We evaluated the ranking functions on the CUL search 
from 10 December 2004 through 18 February 2005 using 
the evaluation method described in Section 17.41 When a 
user connected to the search engine, we randomly selected 
an "evaluation mode" for that user. The user either saw a 
ranking combining relo and relQc or a ranking combining 
relQc and relNC- For consistency, we kept the same combi- 
nation for the duration of each user's session (otherwise, if 
the user immediately re-ran the same query he or she may 
confusingly get different results). 

During the evaluation, we collected about 1200 queries 
in each evaluation mode. The results for both evaluation 
modes are shown in Table [3] These results show a number 
of interesting properties. Firstly, 53% of the time relQc, 
the ranking function trained using query chains, performs 
differently to the original ranking function, relo- 30% of 
the time the two trained ranking functions perform differ- 
ently. In particular, the first of these values indicates that 
our method often makes a difference in search engine per- 
formance. Given that the original ranking function is rea- 
sonable, it would be surprising if these values were much 
higher. As long as our method does not cause relevant doc- 
uments that are ranked highly by relo to be lowered in rank, 
we would see identical performance in the cases when relo 
performs well. 

Secondly, from Table|H]we see that relQc outperforms relo 
more often than we would expect at random were the two 
ranking functions equally good. Using a binomial sign test, 
and the null hypothesis that the two ranking functions are 
equally effective, we are able to reject the null hypothesis 
with over 99% confidence. This establishes that our learned 
ranking function is an improvement over the original one. Of 
course, given the new ranking function, we are collecting new 
training data and can re-run the whole learning process. We 
expect this to produce continued improvement in ranking 
performance. 

Finally, the model trained using query chains outperforms 
the model trained without using query chains with over 99% 
confidence, using the same test. This demonstrates that by 
exploiting the information about query chains present in log 
files, we are able to see a measurable additional improve- 
ment in search engine performance over what we would see 
without using this extra information. 

One may wonder if it makes sense to learn associations be- 
tween specific query words and documents. Given our initial 
9,949 training queries, Table0]shows the top ten words that 
appear most frequently in queries. We see that queries tend 
to be repetitive. Ignoring the three stopwords in the top ten 
words, we found that at least one of the remaining seven 
words appears in 12% of all queries. At least one of the 
top 100 words (removing stopwords) appears in 38% of all 



Word 


Fraction of queries 


of 


3.56 % 


library 


2.75 % 


bibliography 


2.60 % 


and 


2.55 % 


annotated 


2.42 % 


reserve 


2.32 % 


citation 


1.99 % 


web 


1.48 % 


the 


1.41 % 


course 


1.33 % 



Table 4: The most common words to appear in 
queries in the training data, and the fraction of 
queries in which they occur. 



Word 


Document 


Weight 


lexus 


Lexis-Nexis Academic Universe 


22.8 


ebook 


CUL eContent Collection 


22.5 


reuleaux 


CUL Digital Collections 


21.8 


and 


Printable News and Notes 07/03 


19.6 


oed 


Dictionaries and Encyclopedias 


19.5 


ndlf 


Management meeting notes 03/97 


-21.0 


ndlf 


Management meeting notes 02/97 


-20.6 


ndlf 


Management meeting notes 04/96 


-19.5 


ndlf 


Management meeting notes 04/96 


-18.6 


instruction 


Library Research Workshops 


-18.3 



Table 5: Five most positive and most negative fea- 
ture weights in the ranking function learned us- 
ing query chains on the Cornell University Library 
(CUL) search engine 



queries. Moreover, for many popular queries, there appear 
to be only a few documents that are truly relevant to the 
query. Hence it is not surprising that by learning individ- 
ual query word/document associations we can see significant 
improvements in ranking results. 

In order to understand where the improvements are com- 
ing from, it is useful to look at the word/document features 
with largest positive and negative weights. The top and 
bottom five features are given in Table First we con- 
sider the top five features, which for the most part describe 
very sensible associations. The feature for "lexus" is asso- 
ciated with the main homepage of the Lexis-Nexis library 
resource. This is clearly a spelling correction, with a search 
for "lexus" originally returning no results. The same search 
now places the correct document at the top of the results. 
The feature for "ebook" returns the main ebooks web page. 
A search for ebook previously returned seven results, none of 
which were particularly useful. The top one, titled "Answers 
to Frequent Job Searching Research Questions", happened 
to mention access to ebooks from off campus. The feature 
for "reuleaux" is associated with an FAQ page page about 
the CUL digital collections. The web page provides a clear 
link to a site that describes models designed by Professor 
Reuleaux. This contrasts with the original top result being 
a broken link, and the second result being a newsletter with 
only passing reference to the model collection. The feature 
for "and" is of little practical interest (we did not remove 
stopwords). Finally, the fifth word "oed" is an acronym for 
the "Oxford English Dictionary" . The associated document 



clearly links to it, in contrast with the original top result 
which was an information bulletin showing a set of screen 
shots how to get to the OED among other things. 

The five features with the most negative weights in Table 
|S] are equally interesting. Four of them relate to meeting 
notes mentioning the National Digital Library Foundation. 
Using the original ranking function, this search generated 
just 6 results with only such meeting notes. With the learned 
system, a search for "ndlf" now returns similar results to a 
search for "National Digital Library Foundation". These 
results appear slightly more useful from the short abstracts 
that are presented. However, we discovered that in fact the 
search engine had not indexed the main NDLF web page. 
We see here that the search system has recognized users 
running chains of queries looking for the NDLF website, 
although none have been successful in finding it. Despite 
this, some of the worst results for this query have indeed 
been pushed down the results list. The fifth feature is harder 
to interpret, but from log files it appears that users looking 
for the Department of Learning and Instruction saw this 
result and repeatedly skipped it. This document used to 
appear as the top result given the query "instruction" . 

8. CONCLUSIONS AND FUTURE WORK 

In this paper, we have demonstrated that query chains 
can be used to extract useful information from search engine 
log files. After presenting an algorithm to infer preference 
judgments from log files, we showed that the preferences 
judgments are valid, independent of the learning method. 
We then presented a method to identify query chains, and 
an algorithm that uses the preference judgments to learn an 
improved ranking function. The model used for the ranking 
function is more general than in previous work. In partic- 
ular, it allows the algorithm to learn to include new docu- 
ments originally not present in initial search results in the 
learned rankings. The evaluation shows our approach to be 
effective, and that it can learn highly flexible modifications 
to the original search results. The Osmot search engine is 
available to the research community 4 . 

A natural question that arises in this setting is the tol- 
erance of this method to noise in the training data, partic- 
ularly should users click in malicious ways. While we used 
noisy real-world data, we plan to explicitly study the effect 
of noise, words with two meanings, and click-spam on our 
approach. 

Also, the strategies presented in Section 14.11 give equal 
weight to each pair of queries within a query chain. However, 
we suspect that there is additional information present in 
the position of a query within a chain, and of a click within 
the sequence of all clicks for each chain. In particular, it 
is possible that the last query and last clicks may be more 
informative than earlier ones. 

Thirdly, exploiting the fact that it is possible to collect 
virtually unlimited amounts of search engine log data, we 
believe that the methods presented in this paper can be 
extended to learn personalized ranking functions. We are 
currently refining the Osmot search engine and will use it 
on the arXiv.org e-Print archive 5 in order to conduct such 
experiments. 

Finally, from a practical perspective our approach pushes 

4 http: / / www.es. cornell.edu/~filip / osmot / 
5 http: / /www. arxiv.org/ 



the limit of problems that current SVM implementations can 
solve in reasonable time due to the number of constraints we 
generate. We believe there is room for improvement in learn- 
ing methods to efficiently deal with such large numbers of 
constraints, for example using an incremental optimization 
approach. Perhaps there are also alternative learning meth- 
ods, rather than SVMs, that can be used to optimize over 
preference constraints while being able to learn sufficiently 
general ranking functions. 
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